1n Methods with Multiple Sub-scale Multistage Testing 


Chun Wang, 
University of Washington 
Ping Chen 
Beijing Normal University 
Shengyu Jiang 
University of Minnesota 


Correspondence concerning this manuscript should be addressed to Chun Wang at: 


312 Miller Hall 
‘Measurement and Statistics 
College of Education, University of Washington 
2012 Skagit Ln, Seattle, WA 98105 
‘e-mail: wang4066@uw.edu 


‘Acknowledgement: This research was supported by the IES R305D170042 (R305D160010) and NSF 
‘SES-165932. The authors would like to especially thank Drs. Yue Helena Jia, Pual Jewsbury, and 
‘Meng Wu for consolidating the research idea, Dr. Jing Chen from NCES for agreeing to share the 

data, and David Freund for preparing the real data, 


Citation: Wang, C., Chen, P., & Jiang, S. (2019). Item calibration methods with multiple subscale 
multistage testing, Journal of Educational Measurement. htips://doL.org/10.1111/jedm.12241 
Related cade can be downloaded at; hntps:/sites.uw.edu/pmetrics/publications-and-source-code/ 


Item Calibration Methods with Multiple Subscale Multistage Testing 


Abstract 
Many large-scale educational surveys have moved from linear form design to multistage testing 
(MST) design. One advantage of MST is that it can provide more accurate latent trait (7) 
estimates using fewer items than required by linear tests. However, MST generates incomplete 
response data by design; hence questions remain as to how to calibrate items using the 
incomplete data from MST design. Further complication arises when there are multiple 
correlated subscales per tes, and when items from different subscales need to be calibrated 
according to their respective score reporting metric. The current calibration-per-subscale method 
produced biased item parameters, and there is no available method for resolving the challenge. 
Deriving from the missing data principle, we showed when calibrating al items together, the 
Rubin’ (1976) ignorabilty assumption is satisfied such thatthe traditional single-group 
calibration is sufficient, When calibrating items per subscale, we proposed a simple modification 
to the current calibration-per-subscale method that helps reinstate the missing-at-random 
assumption and therefore corrects forthe estimation bias that is otherwise existent. Three 
‘mainstream calibration methods are discussed in the context of MST, they are the marginal 
‘maximum likelihood estimation (MML), the expectation maximization (EM) method, and the 
fixed parameter calibration (FPC). An extensive simulation study is conducted and a real data 
‘example from NAEP is analyzed to provide convincing empirical evidence. 


Key words: multistage testing, missing data, marginal maximum likelihood, EM 


1. Introduction 

With the advent of web-based technology, computer based testing (a.k.a, online testing) 
is becoming the mainstream form of large-scale educational assessments. The landscape of 
‘educational assessment is changing rapidly with the growth of computer-administered tests. As 
‘an example, National Assessment of Educational Progress (NAEP), the “lurgest nationally 
representative and continuing assessment” (e.g, Beaton & Zwick, 1992), has moved from paper 
based assessment (PBA) to digitally based assessment (DBA) recently. 

A particular mode of DBA that NAEP has piloted for Mathematics is the multistage 
testing (MST), which refers to a testing format where “subsets of test items are presented to 
students based on item difficulty and student performance” (Governing Board and NAEP 
Resources!), Figure 1 illustrates a simple, two-stage MST design. The routing block contains 
items spread across a typical range of difficulty levels in PBA, and the targeted blocks differ by 


difficulty—blocks of easy, medium, and hard items, 


Mentng Hock Ted ok 
Figure 1. An illustration of a two-stage MST design used in NAEP 


Compared to the Linear frm design, the MST design has a profound advantage, That is, 
<u to length constraints and the demands placed on the single set of items, linear form tests may 
provide litle information to certain subgroups (mostly highly achieving subgroups or low 
achieving subgroups) because there are not enough items with appropriate difficulty levels to 
measure students in those subgroups. A MST, however, tailors the set of tems (Le, target block) 


* re om cb cot nah aca hd re oa pig ers La ai 
etorerenenee 


a student sees to the student’ individual ability level, so that no student receives too many 
overly easy or difficult items. Consequently, MST can provide more accurate latent trait (2) 
estimates using fewer items than required by PBA (e.g, Weiss, 1982; Wainer, 1990). Moreover, 
the computer-based nature of MST yields many other advantages, such as new item formats, new 
types of skills that can be measured, easier and faster data analysis, and richer behavior data 
collection such as item response time (as part of behavioriprocess data) (e.g, Wang, Zheng, & 
Chang, 2014) 

Despite of the advantages, MST generates incomplete response data by design; hence 
‘questions remain as to whether the item calibration procedure forthe traditional linear forms 
(c, Mislevy, 1991; Mislevy, Beaton, Kaplan, & Sheehan, 1992) can stil apply. Widely used 
calibration methods for linear forms include the marginal maximum likelihood estimation with 
‘expectation maximization implementation (MMLE/EM; Bock & Aitkin, 1981), the expectation 
maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977; Woodruff & Hanson 1996), and 
the fixed parameter calibration (FPC) methods (Ban, Hanson, Wang, Yi, & Harris, 2001; Chen & 
‘Wang, 2016, Chen, Wang, Xin, & Chang, 2017; Kim, 2006). A default assumption made by all 
three methods is thatthe sample is drawn from a single population, although the multiple group 
versions of all three methods have also been developed (Lissitz, Jiao, Li, et al, 2014). While 
MMLEJEM often assumes @ follows a normal distribution, both EM and FPC allow more 
flexible @ distributions. 

Given the MST design in Figure 1, students routed to each module in the second stage 
naturally form three separate subgroups, whose # distributions differ. The three non-equivalent 
‘groups share the same routing block, which serves as the linkage to put all items on the same 


scale, In this regard, it seems intuitively reasonable to assume that there are multiple subgroups 


and the subgroup structure shouldbe taken into account during the eaibration procedure, Indeed, 
several recent studies (Cai, Roussos, & Wang, 2018; Lu, Jia, Wu, 2017) have explored the 
mmuliple-group MML method for MST item calibration, Ther esuts showed thatthe multiple 
group MML performed poorly, yielding large item parameter bias, whereas the single-group 
MML performed well, However, not clear reason is provided to explain the results 

In addition, another layer of complesity arises when the assessment covers multiple 
content subdomains, Fo instance, the mathematics assessment in NAEP has five subscales, 
“Number properties and operations”, “Measurement”, “Geometry”, “Data analysis, Statistics, 
and Probability”, and “Algebra. For score reporting purposes, tems from each subscale need to 
bye calibrated on ther respective scale. Traditionally, the item calibration on each subseale is 
conducted separately using the unidimensional tem response theory (IRT) models and then a 
composite score, which is a weighted combination of the subscale scores, is created to report the 
overall mathematics performance. However, this ealbration-per-subscale approach falled to 
recover item parameters properly within the MST design (eg, Lu, Ja, & Wu, 2018; Wu & Lu, 
2017; Wu & Xi, 2017), and no viable alternative was provided 

‘To sum up there are wo scenarios Where MST item calibration has been explored: the 
frst one is when all items ae puton a single unidimensional scale, and the second one is when 
items from different content subdomains are put on separate unidimensional scales, The aim of 
the paper is two-fold (1) to provide reasons why the curent MST item calibration approaches 


are unsuccessful, including the multiple-group MML (Cai, et a., 2018; Lu, et al., 2017) for the 


3 scenario and the single-group calibration-per-subscale for the second scenario (Wu & Xi, 


2017; Wu & Lu, 2017); and (2) to propose a new method that resolves the challenge in the 


nips: nes. gv nationsreportcard/tdw/analysistrans. aspx 


second scenario, The proposed solution is grounded in Rubin (1976)'s missing data theory, 
\which provides a streamlined framework to explore the MST item calibration for both scenarios. 
Please note that the second scenario was motivated from the operational NAEP analysis, but the 
solutions provided could be apply to other operational designs similar to NAEP. 

‘The est of the paper is organized as follows, We fis briefly introduce the 
unidimensional two-parameter logistic (2PL) model asthe underlying IR models throughout 
the study. Then we will describe the three commonly used item calibration methods 
MMLE/EM, EM, and FPC, All these methods could be used withthe MST data In the next 
section, we wil introduce Rubin (1976)'s missing data theory and its application tothe MST 
design. In particular, we will explain why the current calibration-pr-subscale method with the 
MST design is inadequate, and present anew, simple solution. Two simulation studies are 
presented, followed by a real data llustation, A discussion is presented in the end 
2. Models 

‘The unidimensional 2PL model is used throughout the paper. For 2PL, the item response 
function for item takes the following form 


ne ee : 
7) = a @ 


+ erral 
‘where subscript j indicates item. ay and by denote item discrimination and difficulty parameters 
respectively, and 0 denotes the latent trait measured by the test. Here “.7" is a scaling factor to 


equate the logistic form with the normal ogive form. 


3. Existing Item Calibration Methods 


3.1 Marginal Maximum Likelihood Estimation/Expectation Maximization (MMLE/EM) 


‘The MMLE/EM algorithm (or MML for shor hereafter) for IRT parameter estimation 
‘when the response data is complete has been well established in the literature (eg, Bock 8 
Aitkin, 1981; Misley, 1984). Suppose a tem testis given to V examinees, resulting in an N- 
by.J binary response matrix ¥. Assuming all tems are modelled by 2PL, and let A= (a,b) 
denote the set of unknown item parameters, which are the target parameters in item calibration, 


‘The joint ketinood can be easly writen as 
yop 
rao =[] LT froma eo", @ 


<due to the local independence assumption, Let P(y,1@,,4) = [1}[P)(@?4 — 5(@))'] 


denote the joint probability of y; for notational simplicity. Then the marginal likelihood of A is 


Lay) = It. J Poi8.a) o(Alueo8)a0, @ 
where g(6)j19,03) denotes the density function of # in the population, and py and of are its 
‘mean and variance respectively. Here in Eq, (3) its assumed tha there is one population from 
‘Which the sample is drawn, However, the MML method could also be generalized to multiple 
‘group scenario such thatthe population mean and variance will be group specific (eg, Mislevy, 
etal, 1992; Cai, Yang, & Hansen, 2011), 

‘To remove the scale indeterminacy inherent inthe IRT models, in Eq, (3), one often 
assumes thatthe latent trait @ follows a standard normal distribution (ie., 9 (8|ue = 0,03 = 1) 
‘The marginal likelihood in Eq, (3) cannot be directly maximized easily because there is no 
closed form solution of A, and finding numerical solution means searching in a2x J-dimensional 
space. The EM algorithm, however, provides a viable computational tool to simplify the direct 


‘maximization of the marginal likelihood (Bock & Aitkin, 1981). 


In essence, the EM algorithm alternates between the E-step and M-step. In the E-step, the 


conditional expectation of the complete data log-likelihood (i.e., (AIY,0) 


log(L(Alv,))) with 
respect to the missing data (in this case, 6) is obtained, denoted as 


Egyar(l(AlY, 0)) 


‘yva'(loa(LlY9))) O) 
‘where A” denote the parameter estimates from the rth iteration, The notation Egy” implies that 
the expectation is taken with respect tothe conditional distribution of (i.e, missing data) given 
the observed data (¥) and provisional parameter estimates, P(6|¥, A"). This conditional 
expectation is maximized to obtain the MLE of A in the M-step. This way, the 2x J-dimensional 
‘maximization challenge is reduced to searching a numerical solution in a 2-dimensional space, 
‘hich is much more feasible. 
3.2 Expectation Maximization (EM) algorithm 

‘While the above MML method treats the EM algorithm as a tool to reduce the 
computational complexity of directly maximizing the marginal likelihood, the item calibration 
can also proceed directly from the principal idea of the EM algorithm (Bock & Aitkin, 1981; 
Dempster et al., 1977; Rubin, 1991; Rubin & Thayer, 1982), In this case, the unknown latent trait 
(is considered “missing” data. To model the distribution of @ flexibly let us consider the 


discrete values 6, (K = 1, ..,K) and their associated unknown probabilities my (K = 


ok) 
(Kim, 2006). Here K is the total number of quadrature points along the 4 continuum, Under this 


assumption, 0 distribution can be recovered via the probability mass function, where 


my = 1. In this regard, both the item parameters A = (a, b) and m= (1, 7%) are 
unknown parameters. The IRT latent scale can be fixed by setting the mean of @ at 0, Le., 


x 


m8, = 0, and by setting its variance at 1 


‘The EM algorithm again proceeds by alternating between the E-step and the M-step. 


Here, the conditional expectation is slightly different from Eq. (4) as follows, 


Equa’ n(log( L(A. al¥, 0) 


(anlv.0)) 
De Jioxeca. mir, 03 P( oly. a" a0 


«> ex log(L(AlY Dm) x PCB: Ys, A" 2"), © 


where P(8,|¥;,A,n") isthe post 


distrib 


of @, given ¥,, A" and me. Then in the M-step, 
the conditional expectation is maximized with respect to both A and x. Solving for item 
parameters remains the same as in section 3.1, whereas rf is updated via a simple, closed-form 


solution as follows 


wag oak, O) 


where ff -E)Ly P(Gxl¥. A", ”). Within each EM cycle, to fix the latent scale, a few 


standardization steps need to be in place. In particular, let p"** = 3, m**0;** and o2 4 = 


ikea me*# (f+ — y**)? be the provisional mean and variance of 4, then the discrete 


quadrature points are standardized by updating @f°+* with 22" =4""* 


Accordingly, the 


provisional item parameter estimates are updated as follows: a" is updated with a"** x a", 


*, and 1** is updated with mp** x a" 


updated with 


Although in the above exposition, m is estimated for a single @ distribution, the EM 
algorithm can also be extended for multiple group calibration. That is, group specific n's could 
be estimated for each subpopulation separately. One advantage of the EM algorithm compared to 


MML is that the distribution of @ does not have to be specified in advance, and hence it is more 


flexible to deal with non-normal 0 distributions. This is especially desirable in the multiple group 
calibration approach when the group specific 0 distibutions are unknown, 
3.3 Fixed Parameter Calibration (FPC) 

Fixed parameter calibration refers to fixing a subset of item parameters at their previously 
estimated values and calibrating the remaining items so that ther item parameters are placed on 
the same, fixed scale. In this cas, the seal of @ is naturally determined via the fixed parameters, 
and hence no constraints need tobe added. This method is often used in online calibration 
scenario where new items are calibrated while holding the operational item parameters as fixed 
(ca, Chen & Wang, 2016; Kim, 2006). Both the aforementioned MML and EM methods can be 
used in FPC. With the former method, ifthe distribution is assumed normal, then ts mean and 
variance can be freely estimated; whereas with the later method, the standardization step are no 
longer needed. For more details, please refer to Chen etal. (2017) or Kim (2006). In this pape, 
will consider the EM algorithm coupled with FPC such thatthe @ distribution does not have t be 
pre-specified, Moreover, FPC can also be used with both single group and mulkipe group 


calibration approaches (Kim & Kolen, 2016). 


4. Item Calibration with Missing Data 
By nature, the multistage testing generates incomplete response data because after the 
routing stage, each examinee is routed to one module in the remaining stages that matches 
closely with hisher ability level. Mislevy and Sheenan (1989) showed that in incomplete 
designs, the use of MML could be justified from Rubin's (1976) general theory on inference in 
the presence of missing data, In particular, Mislevy and Wu (1996) argued that missing data due 


to MST (or adaptive) testing can be ignored when making inference about @ because the chance 


10 


for an item to be missing depends on observed responses from previous items but not on 
unobserved responses. However, they didnot discuss the impact of missing data on item 
calibration. Eggen and Verhelst (2011 first proved a bref justification of using ML in the 
MST item calibration, but they did not mention the scenario when the test contains multiple 
subscales, In this section, we intend to provide a comprehensive discussion with regard to the 
missing data mechanism ofthe MST design within Rubin’ (1976) framework, especially the 
implications of missing data on item calibration when the tet contains multiple subscales, Please 
note that for exposition simplicity, we assume there is only one form per module in the MST 
design inthis paper. However, in practice, there is oftentimes multiple, parallel forms per 
modiule, Because the parallel forms are usually randomly assigned, the missing data resulting 
from this random assignment is completely random and hence it canbe ignored 

Essential to Rubin's (1976) theory is the stochastic nature of the missing data mechanism 
(Little & Rubin, 1987), denoted as 

hg(M = mlY = y), o 
where M = (M,,...,My) is the missing data indicator, indicating whether ¥ is actually observed 
(ce,, mj = 1) or missing (i... mj = 0). y is the response on item j. Eq. (7) defines the process 
that causes the missing data, with the parameter p that governs the missing mechanism, 

In the incomplete design, we have a sample realization of M and Yj, “obs” denotes 
‘observed responses). So we can only estimate the item parameters of interest (ie, A) based on 
partially observed ¥, which is the marginal joint distribution of M and ¥ ops a5, 

SpgucFo(™ YIM) dynis=fy,., FOU) ply) Aine ® 
where fp(m, y/A) is the joint distribution of the complete data (i., ¥ = (Yoos,¥imis )) and the 
missing indicators. According to Rubin (1976), if the process that causes missing data can be 


u 


ignored, then Eq, (8) is equivalent tof, f(t, 1A) Anus = Fons 14). implying thatthe 
parameter of interest, A, can be inferred directly from the observed data 

Rubin (1976) provides sufficient conditions under which ignoring the missing data 
mechanism still yields correct direct likelihood inference about A. The conditions are: (1) 
Satisfying missing at random assumption (MAR), ie. for each value of @, hy amy on, Yinis) 
‘(my .) forall values of Ys; (2) The parameter @ is distinct from 4, which means that all 
possible values of g are possible in combination with all possible values of A. 

In what follows, we will discuss the missing mechanisms induced by the two routing 


rules using Rubin’ (1976) framework. One is based on @ which is used in the cuent 


‘operational testing, and the other is based on true 6 which is certainly unrealistic. The rationale 
for considering the latter design is that several previous studies used multiple-group calibration 


approach to estimate item parameters from the MST dé 


but they were unsuccessful (eg, Lu 
tal, 2017; Cai eta, 2018), Therefore, we intend to provide a theory grounded argument that 
only when the routing is based on true thatthe multiple-group approach is needed. This 
argument is also further backed up by the simulation result in section S. 
4.1 Routing based on 8 

‘We first consider a MST design where the routing rule is based on interim 8, whichis 
estimated from the responses and the previously known item parameters? in the routing block. 
Under this design, the marginal likelihood of A for person i by integrating out both Yimis and 0 


"These are the inilly estimated item parameters obtained from the previous administrations. The item parameters 
willbe recalibrated again with the MST-generated data, whichis the rutin analysis in NAEP to avoid any 
berrances duet tem parameter drift 


2 


LJ Lg (@.AIy ions Yims JAY, isd 


e in f L(@ALY cobs Yimis) tg (MY ions:Yimisr AY: mist 


=f, f 1 (8,Aly.o0s,Yimis Re (MilY ons) AY mis, oO 


‘where yf, denotes the observed responses on the items inthe routing block. Here,» contains the 
pre-specified cut-offs for routing decisions and therefore it is distinct from the target parameters 
‘A. Because the missing data mechanism only depends on observed data, the MAR assumption is 
automatically satisfied. Then, the last equality in Eq, (9) holds, and hy (mlyn.) indicates the 
missing data process that depends on the observed response vector because is estimated from 


hone 


Further expand L (0, IY, s-Yimus) = [11)P(us!®4i)]40@leo = 0.05 = PO l) 
in Eq. (9) such that it can be simplified as 

Sy (PO cis!9-4))] 9 lne = 0,09 = DhegCmmilyfs,) AB (20) 
This is because, P(YLmsl®) dycmis = 1. So the MML item calibration method intends to 


‘maximize the marginal likelihood 


[IL [froma 
[Jieimarta [frowns 


‘When both the MAR assumption and distinctiveness assumption are satisfied, Rubin (1976)'s, 


late = 0.55 = Yhy(MY~on,)A9 


9(0lu0 = 0,03 = 140. ay 


ignorability condition is satisfied. Hence, a single-group marginal maximum likelihood (MML) 


a 


introduced in section 3.1 is suficient for item calibration in this MST design. Indeed, aftr taking 
2 log-transformation of Eq. (11), the term hy (m|y{op.) is no longer relevant because it does not 


contain A. In this case, maximizing Eq. (1) 


equivalent to maximizing Eq. (3). 
We can also show that for the EM algorithm in section 3,2, when the MAR assumption is 


isfied, the EM algorithm can proceed based solely on the observed data. The detailed 


derivation is provided in the Appendix. 


4.2 Routing based on true 6 

Several recent studies (Cai, et al., 2018; Lu et al., 2018) have used the multiple-group 
MML method for MST item calibration and found biased parameter estimates. In this section, we 
‘will show that, from missing data principle, the multiple-group calibration approach is only 
appropriate for a special, unrealistic, scenario where the routing is based on true 8. Even so, the 
‘group specific @ distribution also needs to be defined correctly. 

In the current practice, multiple-group MML proceeds by assuming 0 distribution follows 
normal N( iy, 03), where g denotes the gth group (Cai, etal, 2011). There are wo commonly 
used approaches to remove the scale indeterminacy, The frst approach i to let the mean and 
‘variance forall three groups be estimable parameters with the constrains that the overall mean 
and standard deviation are 0 and 1 respectively (Lu, Jia, & Wu, 2017). The second approach isto 
fix the mean and variance of @ in one group to constants, and let them in all remaining groups to 
be freely estimated. 

According to the discussion in 4.1, when routing is based on 6, the ignorablilty 
condition is satisfied and hence a single-group MML is sufficient, Using multiple-group MML 


not only adds estimation complexity due to additional parameters, but itis also based on a false 


14 


assumption that 0 distribution for each subgroup follows a normal distribution. This is exactly 
the reason why the previous studies using the mulliple-group calibration were unsuccessful 
‘There is one exception when the multiple-group MM is necessary. That is when the outing 
decision is made based on true 8. For example, let, and ¢, be the two cutoffs along the @ 


continuum, and now the missing mechanism is 


psc, lfthis person takes the difficult block. 
hg (an|) = } 1,202, If this person takes the medium block 2) 
occ, Ifthis person takes the easy block 


‘The MAR assumption is no longer satisfied because the missing data depends on the unknown 
latent variable @ which itself is also missing, Replacing hg(m|yf,.,) in Eq. (10) by (12) results 


ina marginal likelihood that is comprised of three components, 


Mecauean Sy (11)? (.00s19.4;)]9(4|u0 ered x 


Teemesiun So [11j?O%ors!94))]9 (Oly = 0.05 = 1)teycosetO 


Teceasy Jo [Ij P(yeonsl9.4;)]9(O|to = 0,09 = 1)to<0,40 aay 


It's clear from Eq, (13) that a three-group calibration needs to be performed, and each group has 


‘a @ distribution that follows a truncation of a standard normal distribution, 


44.3 The challenge of calibration by subscale 
“Many large scale asessments such as NAEP or PISA (et, Liu, Wilson, & Paek, 2008) 
measure students’ performance on multiple subscales within a given subject domain. The 
standard practice of NAEP item calibration isto calibrate items from each subscale separately 
using the wadtional single-group MML method (Wu & Lu, 2017; Wu & Xi, 2017). However, 


this procedure yields biased item parameter estimates when the response data are collected from 


1 


the MST design (e.g., Lu et al., 2017). Previous studies have neither given a justifiable 
explanation nor provided a viable solution. 


In fact, from the missing data theory, it can be easily verified that when the calibration is 


conducted per subscale, the MAR assumption is violated. This is because, by design, the missing 
data mechanism is based on the observed responses from ll items in the routing block, ie., 
‘ag (mi|yf,9,)- However, ifone conducts the calibration per subscale, for subscale d, we have 


‘the marginal likelihood for person i as follows, 


[il teeta rtnmdertnidet 


= [Jp HA etiwanda rtd! a 


sms in subscale d in the 


In Eq, (14), yf, denotes the observed responses from person i on i 
routing block. Please note that because the missing data function hy(mly,) # h(alyon), 
using (14) will inevitably introduce bias due to the misspecification of the missing data function, 


Indeed, if let yfans = (yeh ysl). where yp denotes the observed responses from person i 


com all items in the routing block except subscale then if one performs the calibration by 
subscale via ML following Eq, (14), ys is considered as “missing” data because i is not 
used inthe calibration. Therefore, the missing data actually depends on the “missing” 
‘observations, violating the missing at random assumption. Following this argument, a simple 
solution is to augment the subscale data yZ,,, by yf, and the MAR assumption wil be 
satisfied such that a single group MML sil applies. 

Figure 2 provides an ilustrative comparison of the traditional calibration per subscale 


approach, and our proposed, modified approach. Assuming the test contains three subscales, for 


16 


the modified approach, although item responses from the other two subscales in the routing 


block are also used in item calibration, the item parameters for those subscales are considered, 


(@) Traditional approach (b) Modified approach 
DT] ey Al Easy 
wn me Metin 


NALD pene YT ont 


Figure 2. Illustration of calibration per subscale. The three boxes with different colored lines represent 
thee different subscales, If ane intends to calibrate item parameters from scale 1 (red color), tem 
‘esponses from the shaded area are used as input. 


‘Two simulation studies were conducted to evaluate the performance ofthe diferent 
calibration methods under a typical NAEP design, The 2PL model was used throughout the 
simulation studies because is item parameters tnd tobe relatively easy to recover, whereas the 
¢-parameter estimation in the PL model is known tobe challenging (Thissen & Wainer, 1982; 
Swaminathan & Gifford, 1986). 
5.1 Design and Methods 

Hem bank and MST design ‘The tems were obtained from NAEP 2011 Grade 8 
mathematics assessment. The item bank was constructed by pooling together items in all five 
content areas and all testing blocks. There were 115 items in toa, from which four testing 


modules were assembled, For content balancing purpose, the following procedure was conducted 


“The eel item parameters were ered fom 
ups:/nces ed gov naionsreportcard/tdw/analysi/scaling it mathaspx 


U7 


con each ofthe five subscales, First, the tems were ranked in aseending order in terms ofthe 
discrimination parameter. About 4 of the tems with the lowest values were chosen to form the 
routing module, Selecting items with lowersa parameter atthe beginning ofthe testis consistent 
with the suggestions in Chang and Ying (1996). This design not only helps balance item usage 
and but also makes the test more robust to random errors (incorrect answers due test 
anxiety atthe beginning ofthe test (Chang & Ying, 2008). Then, the remaining items were 
ordered by the difficulty parameter, and an “easy” module is made from 18 ofthe remaining 
items with the lowest dificult, Similarly, the “difficule” module consists of 1/3 of the most 
dificult items. The final module consists ofthe lst 19 items with medium difficulty. Table 1 
shows the numberof tems in each module and each subscale. Although the numberof items per 
subscale difers by test design, they are roughly evenly distributed across four modules. Table 2 
presents the descriptive statistics of the tem parameters 


‘Table 1. Item distributions per module and per subscale 


Number sense, | Measurement | Geometry | Dataanalysis, | Algebra | Total 
properties, and ‘and spatial | statistics, and | and 
‘operations sense probability | functions 
Routing 3 @ 7 4 3 3 
Easy A 5 6 4 a 29 
Medium a 5 6 3 a a 
Hard 3 5 6 3 9 28 
Total 19 2 25 14 36 15 


‘Table 2. Descriptive statistics of item parameters 


Mean 3D 
a[bfal[s 
Routing [0.63 [0.01 [0.14 [1.24 
Easy [1.05 |-0.34 [0.24 [OL 
Medium [1.08 [0.46 [0.27 [0.38 
Hard [1.21 | 1.24 [0.37 [04a 
Total [0.98 | 0.32 [0.34 [0.94 


18 


Response generation and routing, Two simulation designs (denoted as Design I and 
Desig Il) were considered depending upon the routing methods (outing based on tue @ vs 
routing based on), Sample size was set a3,000, In both designs, every simulee responded to 
the tems inthe routing module, and roughly 1/3 ofthe simulees were routed to one ofthe three 
target modules based on the routing rules, The 1/3 and 2/2 quantiles ofthe standard normal 
Aistrihution were chosen asthe two fixed cut points, and they are ¢; = ~.438 and e = 43 

For design Ia group of 3000 simules’ true @s was generated from a standard normal 
distribution, Then the responses were generated based on 2PL in Eq, (1). The next module was 
decided by the location ofthe simulees' true Os relative tothe cut points, If ther trues were 
smaller than c, they were assigned to the easy module; if their true @s were larger than cy, they 


‘were assigned to the difficult module; and if their true @s were between the two cut points, they 


‘were assigned to the medium module, This design, ahough unrealistic in practice, result 
‘missing not at random (MNAR) condition. 

Design I only differs from Desiga I by the routing method, To reduce random error, 
Design II shared the same 3000 6s and the same responses from the routing block in Design 
After the routing stage, individual @ was estimated via the expected a posterior (EAP) with a 
standard normal prior and the next module was decided by the location of 8 relative tothe cut 
points. This design results in a MAR condition. 

Calibration methods Table 3 summarizes the calibration methods used in the two 


simulation designs. If viewing all items in the test measure a single, unidimensional trait, five 


"See Eq (2.1) on page 7 of the following document 
hups://wew nag gov contenvnagb/aseets/document/publicstions/achiovementdeveloping-achievement-levels. 
201 Lenaep-sradl-prade2-witing-technical-epor. pd 


19 


<ifferent methods were compared. They are (1) the single-group MML (denoted as S-MML. 
hereafter) assuming the entire calibration sample as a single group with @ froma standard normal 
distribution; (2) the multiple-group MML with all normal (denoted as M-MML-N), where we 
assume the population consists of three subpopulations, all of which follow a normal distribution 
‘with group specific mean and variance. Here the mean and variance for the middle group were 
fixed at their true values to fix the scale. We considered this method jus to replicate the studies 
by Cai etal. (2018) and Lu et al. (2018); (3) the multiple-group MML with truncated normal 
(denoted as M-MML-T) according to the description in section 4.2; (4) the single-group fixed 
parameter calibration (S-FPC) and (5) the multiple-group FPC (M-FPC). With FPC, the 
calibration proceeds in two steps. Inthe first step, the complete response matrix from the routing 
block were calibrated as usual, then those routing item parameters were fixed at their estimated 
values, and the targeted block items were calibrated via FPC. By single-group, we refer to 
estimating m,’s as if they are from a single population, whereas by multiple-group, we refer to 
estimating m,’s separately for three subpopulations. It is anticipated that when the MAR 
assumption is satisfied with 0 routing, all single-group methods should outperform the multiple- 
‘group methods. On the other hand, when the MAR assumption is violated with true 0 routing, the 
multiple-group methods should be preferred 

With respect to calibration per subscale, again both single-group and multiple-group 
approaches were evaluated. Within the single-group framework, we considered both the MML. 
‘and FPC methods, and for each method, we considered two scenarios: the one with all routing 
items (e., our modified approach that satisfies the MAR assumption) and the one with the 
routing items only pertinent to the corresponding subscales (i.e., current method). These 2 (MML. 


vvs. FPC) by 2 (All vs. Only) result in four methods, denoted as S-MML-All, S-FPC-All, S- 


20 


MML.-Only, and S-FPC-Only hereafter, Regarding the multiple group approach, we only 


considered the FPC method because it does not need to specify the distribution of in advance. 


‘They are referred to as multiple-group FPC with only subscale relevant routing items (M-FPC- 


Only) and with all routing items (M-FPC-All), 

30 replications were conducted per condition, and two prior distributions of the item 
parameters (ie., 1og(a)~N (0, 0.52) and b~N (0,22)) under the 2PL model were specified for 
effective runs of the FPC method, These are the default priors used in BILOG-MG and 
PARSCALE (Kim, 2006, p. 357) 


‘Table 3, Summary of the calibration methods for different simulation designs” 


EM per subscale 


Singles Methods Notation | Simulation 1 ‘Simulation 1 
Multiple group Unidimensional 2PL with | Unidimensional PL with 
8 routing 8 rowing 

Scenario 1 All ems ae calibaed on a single scale 

5 MM SMM 7 v 

w MMI wil all normal | M-MMI- 7 7 

N 

wr TM wih wuncated | MEME v 7 
oral 

5 Foxed paramcier EM | SFPC Tv v 
(eC) 

wT Fixed parameter EM | NFP v v 
(Frc) 
fom diferent Conon areas (ce, Scales) ae aUbTaTed on Separate Scales 
MME per subscale | S-MML- v 

Only 

ry Modified NIN per | S-MNIL= V 
subscale All 

5 Fixed parameter | S-1PC= 7 
EM persubscale__| Only 

5 Modified Fixed SHPCAT v 
parameter 
EM per subscale 

wr Fixed parameter EM | W-FPC™ v 
per subscale Only 

co Modified Fixed MFP V 
parameter All 


‘The R and MATLAB source code fr running all the proposed methods canbe found on 
bnups/sites uv. edujpmetrcs publications-and-source-eode 


a 


5.2 Results 

Overall unidimensional calibration The evaluation criteria are the average bas and root 
mean squared error (RMSE) of the a- and b- parameters, They were computed frst across all 
replications per item, and then averaged overall items. The parameter recovery were 
summarized for both all items and items within each block. Table 4 presents the average bias and 
RMSE for design I with true 0 routing and Table 5 presents the item parameter recovery for 
design II with estimated 8 routing 


Table 4. Average bias and RMSE of a- and b- parameters with 2PL model calibration for Design 
L(ie., true 4 routing) 


AID Routing [Easy [Medium Tad] 

Method [a [p> [a [pla [6 [a [bp [a [p>] 
bias 

SMM [-028 [025 [OT [-008 | -030 [oa [Oot [O78 [027 [Os 

MMMEN| 0.00 [0.01 [0.00] 002 |-0.05 [0.04 [0.04 [0.00 [0.02 |-0.01 

M-MML-T [0.01 [0.02 [0.04 | 0.00-|0.00 [0.08 [-0.03 | 005 [0.01 |-0.08 

S-FeC__[-027 fo21 [0.00 [0.01 |-0:30 [0.45 | -0.55 [0.49 | 0.28 [-0.09 | 

M-FPC_|-0.05 [0.03 [0.00 [001 [-0.05 [0.05 [-0.10 [o.07 [-005 [oor 
RSE 

SMM_[030 [055 [oor [008 [ost [oss [ows [rsa poxs [oI | 

MMMEN [012 [0.08 [0.03 [005 [0.14 [0.10 [020 [0.11 [ott 005] 

MMML-T [0-14 [0.12 [0.05 [0.08 [0:16 [0.14 0.24 [0.20 [0.12 | 009 

SFeC___[029 [ost [00s [007 [os2 [oar |os6 [059 ]029 [0.15 

M-rPC_[0.13 [0.10 [0.04 [007 [0-15 [0.12 [022 [0.15 [0.12 [006 


Table 5. Average bias and RMSE of a- and b- parameters with 2PL model calibration for Design 
11 (ie, estimated 6 routing) 

aI Routing [Fas Median [Hard 

Matiod [a [ea fa [be eT 
Tas 

SMM [oor [or [or aos [0m [ons oo [a [oo [oe 

MMMEN [0.10 [001-026 [002 [-0.10]-095 [oor [007 [0.10034 

MMMLT [035 [010 [0.05 [o.00_[0.35 |-0.18 [079 [0.20 [O26 [0.03 


SFPC[-0.02]0.02 [0.00 [0.01 ]-0.02 [0.03 [-0.02 [0.02 [-0.04] 0.01 
merec [oat [-0.10 [0.00 [0.01 [0.39 [-0.32 [1.01 [-0.2i fost [0.10 
RMSE 


EMME [0.10 [008 [O03 [ous [0.12 [OOs [O14 [009 [OMT 
M-MML-N [0.16 [0.63 [0.26 [0.64 [0.13 [as [0.14 [007 [0.13 [094 
M-MML-T [0.38 [0.15 [0.07 [0.03 [0.39 [0.20 [0.83 [0.23 [0.29 [0.08 
S-FPC [0.10 [0.08 [0.04 [0.07 [o.11 [0.09 [0.13 [0.08 [0.11 [0.06 
MFPC [0.44 [0.19 [0.08 [0.07 [os [oss [1.04 [0.25 [oad [0.13 


2 


Several conclusions can be drawn from Tables 4 and 5. First and unsurprisingly, the 


MML and FPC methods, including both of their single-group and multiple-group versions, 


perform similarly in all conditions. That is, when S-MML performs well, S-FPC also performs 


‘well. In contrast, when M-MML (both M-MML-N and M-MML-T) performs badly, so does M- 


FPC. This indicates that one can either concurrently calibrate all items or calibrate routing items 


first and targeted block items second, Second and more interestingly, when routing is based on 
‘rue 6, multiple-group approach outperforms single-group approach regardless of the specific 
calibration method. In this case, the MAR assumption is violated, and using a single-group 


approach based on observed data ignores the missing data mechanism. As a result, the item 


parameter estimates are severely biased. On the other hand, when routing is based on estimated 


6 then the single-group approach performs much better than the multiple-group approach. Inthe 


latter case, the items in the routing block are still recovered well, itis the targeted blocks that are 


adversely affected 


Calibration per subscale Although the items are calibrated separately per subscale, the 


same evaluation criteria were still used to summarize the parameter recovery. In this case, only 


simulation design II was considered because they mimic the real practice closely. Tables 6 


reports the results for Design IL. 


‘Table 6. Average bias and RMSE of a- and b- parameters with 2PL model calibration per 


subscale for Design Il (i. estimated routing) 


‘AIL Rout Easy “Medium Tard 
Mahod A 
Bias 
SMML-Only [035 [025 [-004 [000 [-039 [Ore [067 [oes [034 [0s 
SMML-All_[0.01 [0.00-[0.00 [0.00 [0.02 [0.00 [0.01 0.00 [0.01 [-0.01 
S-FPC-Only|-0.29[0.19 [0.00 [001 [-0.34 [ori [054 [0.40 [-0.30 |-0.35 
S-FPCAI —|-0.01 [0.02 [0.00 0.01 |-0.01 [0.02 |-0.02 [0.02 |-0.03 [0.01 
M-FPC-Only [0.20 [-0.08 [0.00 [0.01 |024 [-027 044 |-0.16 [0.16 [0.11 
M-FPC-AN [0.73 [-0.13 [0.00 [0.01 [0.70 [-0.38 [1.95 | -0.27 [0.39 [0.09 


23 


RMSE 
SMMLOnly [036 [OSS [0.06 [oo [040 [7s [OST [00 [036 [037 
SMML-All [0.12 [0.08 [0.04 [0.07 [0.15 [009 [os [0.10 [0.12 [0.06 
S-FPC-Only [0.31 [042 [0.04 [0.07] 0.35 [0.72 [055 [oa [032 [0.39 
SFPCAN [0.11 [0.08 [0.04 [0.07 [0.13 [0.09 [0.16 [0.09 [0.13 [0.07 
M-FPC-Only [0.29 [0.17 [0.04 [0.07 [0.34 [029 [055 [0.19 [025 [0.14 
M-FPC-AN [0.79 [023 [0.04 [0070.77 [oa9 [205 [0.33 [043 [0.13 


Itis shown from Table 6 that, consistent with prior findings (¢., Lu etal, 2017), using a 
single-group MML or a single-group FPC per subscale calibration leads to severe bias. This is 
due tothe violation ofthe MAR assumption. The modified approach, however, by augmenting 
the subscale item responses by responses onal routing items, help satisfy the MAR assumption 
‘Therefore, as expected, the modified approach greatly improves estimation accuracy. Both S- 
MML.-Al and $-FPC-All result in almost unbiased parameter estimates. Another interesting 
finding worth mentioning is, when the MAR assumption is violated, the multiple group approach 
‘outperforms the single group approach. This i reflected inthe better results from M-FPC-Only 
than from $-FPC-Only although M-FPC-Only stl yields large bias and RMSE relatively. One 
explanation is, the numberof items per subscale per block (see Table 1) is too few to help 
recover the underlying 9 distibution per group in the M-FPC-Only approach. Further simulation 


studies need to be conducted to verify the conjecture. 


6. Real Data Analysis 
‘The real response data from a special NAEP MST grade 8 math assessment study in year 


201 


js used as an example. The total sample size is 8,401, in which about 40% of the students 
(Na = 334) were placed in the experiment sample (taking the two-stage MST, see Figure 1), and 
roughly 60% (Ni: = 5057) were in the calibration sample (random routing). In the routing stage, 


there are two parallel forms, and examinees were randomly assigned to one of the two forms, 


24 


hence the missing data in the routing stage is completely at random. Table 7 presents the sample 
size per form and per target block from each sample, As shown, the sample sizes are comparable 
across different forms/blocks, and the sample size is enough to calibrate the 2PL model 


parameters accurately. Table 8 presents the number of items per content domain within each 


form/block. 
‘Table 7. Sample size per form/block 

Routing Form T 2 

“Target block Easy | Medium | Hard [Easy | Medium [Hard | Total 

Experiment sample [669 [715 [273 [eal [734 [272 [3344 

Calibration sample [47 [826 [868 [857 _[a4a [ail _| 5057 

Total 1516 [154i [114 [1538] 1582 | 1083 [Bao 


‘Table 8. Number of items per content domain in each form/block. 


Routing. “Target 
Form | Form2 [Easy | Medium [Hand 
Number properties and operations [3 42 2 ca 
Measurement 3 3 [22 2_] 
‘Geometn 3 33 _|3 3_] 
Data analysis satiaties and probability [2 Pee ee 2 
Algebra a 3 [ss 3 


For both samples, two scenarios were considered, ie., items from the entire test were 
calibrated on a single scale (labeled as “overall calibration” in Table 9) and items from each 
‘content area were calibrated on separate scales (labeled as “calibration per subscale”). For 


‘overall calibration, four methods are compared. They are the single group maximum likelihood 


estimation (S-MML), single-group EM (S-EM),single-group fixed parameter calibration (S- 
FPC), and multiple group FPC (M-FPC). The multiple-group MML is not considered because 
the FPC method is more flexible to model the different shapes of distributions per group. Itis 


‘expected that all four approaches will produce similar item parameter estimates when data comes 


25 


from the calibration sample, whereas M-FPC will produce biased item parameter estimates when 
the data comes from experiment sample. 

For the calibration per subscale, which is more interesting, two approaches are compared 
as shown in Table 9, They are both single-group methods because the multple-group alternatives 
didnot produce satisfactory results according tothe simulation findings. Again, both methods 
should work reasonably well on the calibration sample, whereas only the S-FPC-All method is 
expected to produce comparable and almost unbiased item parameters using the experiment 
sample, The S-MML-Only and the S-MML-All methods evaluated in the simulation study are no 
longer considered here fortwo reasons: (1) Both of them perform similarly tothe FPC 
alternatives when the population distribution of @ is normal; and 2) the distribution of @ in the 
current sample departs slighty from normal (see Figures 3 and 4) and hence FPC is prefered 

‘able 9, Calibration plan forthe real data 


‘Overall calibration Callbration per subscale] 
Callbration/Experiment sample [SIMI [5-EM [S-FPC [NEFPC [S-FPC-Only [5-FPC-AIT] 


6.1 Overall calibration results 
Figures 3 and 4 present the seater plots ofthe estimated item a- and b- parameters from 
pairs of methods forthe two samples respectively. In both Figures, the single-group EM method 
serves asthe benchmark method because as discussed eatlier, the missing data inthis scenario 
could be considered MAR. As shown in Figure 3, the item parameter estimates from all four 
methods align well when the data is from the calibration sample, Note that S-EM, S-FPC, and 
M-FPC allow flexible (non-parametric) @distibutions whereas S-MML. implicily assumes a 
normal 0 distribution. There sa slight misalignment between the estimated a-parameters from S- 
MML versus the estimates from S-EM, which implies thatthe @ distribution in the calibration 


sample does not strictly follow a normal distribution. This misalignment is exacerbated the 


6 


experiment sample, Moreover, in Figure 4 both S-EM and S-FPC produce similar parameter 
estimates, whereas the item parameter estimates from M-EPC do not aga well. This is 
consistent with the simulation findings. In addition, because the 0 distribution in the calibration 
and experiment samples do not seem tobe the same, the item parameter estimates from these wo 
samples may not be directly comparable, resulting ina slight misalignment in Figure 5, Given 
this observation, the comparison between the two samples willbe dropped from further 


discussion, 


{ltration sample: aparameter 
ra 


atration sample: parameter 


Figure 3, Scatter plots of the estimated item a- and b- parameters from overall calibration for the 
calibration sample 


a 


periment sample: aparamete 


SMM 


SMM 


Figure 4, Scatter plots of the estimated item a- and b- parameters from overall calibration for the 
experiment sample 


(=i wall 
re Fi 
i i f 

a a | 


Figure 5, Scatter plots of the estimated item a- and b- parameters from calibration vs, experiment 
sample using the S-EM algorithm 


620: 


ation per subscale results 


28 


‘This section includes the results of calibrating items from diferent content areas on their 
respective scales, Figure 6 presents the results for the calibration sample, comparing the S-FPC- 
Only and $-FPC-All methods against S-EM whieh is again the benchmark, Similar findings 
‘emerge, That is, the S-FPC-All method generates item parameter estimates that are in closer 
ligament with the EM approach, whereas the S-FPC-Only approach produces biased item 
parameter estimates, The biases are much more extreme when evaluating the results frm the 
experiment sample, as reflected in Figure 7, This observation further reinforces that our proposed 
S-FPC-All approach shouldbe prefered tothe original S-FPC-Only approach because it 


reinstates the MAR assumption, 


™ 


Figure 6, Scatter plots of the estimated unidimensional item a- and b- parameters from the 
calibration sample 


29 


S#PC-Ony 


SFPCAl 


sem sem 


Figure 7. Seater plots of the estimated unidimensional tem a- and b= parameters from the 
experiment sample 
7. Discussion 

Mulustage testing design has recenly emerged asa powerful test delivery mode because 
it can help measure the high-achieving and low-achieving subgroups more accurately than the 
traditional linear forms (Yan, von Davier, & Lewis, 2014), On the other hand, compared to 
computerized adaptive testing that is fully adaptive at item level, MST contains pre-assembled 
forms such thatthe various constraints in the test blueprint can be checked in advance, In 
practice, ithe same items are used over along period of time, the parameters of those items are 
often recalibrated to check potential parameter dif, lem calibration i an important step In any 
IRT based scoring and inference. Any biases introduced in item calibration will propagate in 


subsequent steps and consequently bias the conclusions that may have profound policy 


30 


relevance. Only when the item parameters are precisely calibrated and linked across years can 
long-term trend lines be constructed and subgroup comparisons made. 

{Questions remain as to how to calibrate items using the incomplete data from the MST 
design, Complication arises when there are multiple correlated subscales per assessment, and 
‘when it is necessary to put item parameters on their respective subscale score reporting metic. 
Although several recent studies have started to explore various item calibration methods with the 
MST design (eg, Lu etal, 2017, 2018, Cai, etal, 2018, Jewsbury & van Rijn, 2018), they have 
not thoroughly analyzed the MST calibration challenge from a missing data perspective. For 
‘example, Lu et al. (2018) tied to provide different priors on a- and b- parameters to bring down 
the estimation bia, but there was not much success, Therefore, it remains unclear why the 
smuliple-group EM does not produce an acceptable parameter recovery. In addition, a viable 
method is needed to properly calibrate item parameters per subscale 

In this paper, we draw upon Rubin (1976)'s missing data theory, and explicitly show 
that when the routing decision is based on 8, the ignorability condition (ie., MAR and 
distinctiveness assumption) is satisfied such that the as-usual, single-group calibration methods 
are sufficient, Using a multiple-group approach, however, will introduce addtional bias 
regardless of the actual calibration methods. On the other hand, when the MAR assumption is 
violated, asin the true 0 routing condition, the multiple group approach is necessary, As an 
addtional check, Table 10 presents the “misclassification” rate from the simulation design I 
‘The true group membership is based on comparing an individual's true @ to the two cutoffs, 
‘whereas the assigned group membership is based on estimated 4. Although there is only about, 
20% discrepancy on average, the same calibration method can perform drastically different inthe 


two scenarios (true 0 vs. estimated @ routing), as reflected by results in Tables 4 & 5. This 


Bt 


reinforces the importance of checking the MAR assumption. In fact, prior studies (¢., Mislevy 
& Wu, 1996; Glas, 2010; Eggen & Verhelst, 2011) have concluded that the MAR assumption is 
satisfied for MST design when the focus is on ¢-estmation given known item parameters 
(Mislevy & Sheena, 1989), oF on item calibration (Glas, 2010; Eggen & Verhelst, 2011). 
Following this perspective, we propose a simple, yet effective method to resolve the calibration 
by subscale challenge. The key i to augment the response daa such that MAR assumption is 
satisfied 


‘Table 10, Misclassification rate from the simulation design II 


True Group | __ Assigned Group 
based on @ based on 8 
Easy [Medium [Difficult 
Easy [830 166 | 002 
Medium | 140 | _.725 | 135 
Difficue | 0 [157 | 845 


In this paper, tree mainstream calibration methods are reviewed and discussed in the 
context of missing data they are MML, EM, and FPC, While MML often assumes 0 follows a 
normal distribution or ther known parametric distributions, the EM algorithm can naturally 
handle the ease when the parametse form ofthe 0 distribution is unknown, This is because it 
liectly estimates the probability mass function of 0 by treating it asa discrete random variable 
‘This feature is extremely useful in particular within the FPC framework because when certain 
item parameters are fied, the entice 0 disteibution canbe freely estimated. For instance, in the 
simulation design I when routing is based on true 6, both the multple-group MML with normal 
(M-MML-N) and multiple-group MML with truncated normal (M-MML-T) methods assume the 
shape of @ distribution per group is known, whereas M-FPC estimates the shape ofthe 


distribution per group freely. Despite of these differences, the three methods, both their single- 


2 


troup version and muliple-group version, all perform similarly and hence they can be used 
exchangeably whenever situation allows. Last but not least, in addition to the proposed new item 
calibration method, the challenge could also be potentially resolved by using a multidimensional 
IRT (MIRT) calibration, Ths is because MIRT calibration also takes into account all tem 
responses in the routing block simultaneously. Future studies could compare the MIRT 


calibration versus the several methods considered herein, 


33 


References 

Ban, J-C., Hanson, B. H., Wang, T. Y., Yi, Q., & Harris, D. J. (2001). A comparative study of 
on-line pretest item-calibration/scaling methods in computerized adaptive testing 
Journal of Educational Measurement, 38, 191-212. 

Beaton, A. E., & Zwick, R. (1992), Overview of the national assessment of educational progress. 
Journal of Educational Statistics, 17, 95-109, 

Bock, R. D., & Aitkin, M. (1981), Marginal maximum likelihood estimation of item parameters: 
Application of an EM algorithm. Psychometrika, 46, 443-859. 

Cai, L. 2008), A Metropolis-Hastings Robbins-Monro algorithm for maximum likelihood 
nonlinear latent structure analysis with a comprehensive measurement model. 
‘Unpublished doctoral dissertation, Department of Psychology, University of North 
Carolina at Chapel Hill 

Cai, L., Roussos, L, & Wang, X. (2018). Comparison of calibration and drift detection methods 
under multistage testing. Paper presented at the NCME annual meeting, New York City, 
NY. 

Cai, L., Yang, J.S., & Hansen, M. (2011). Generalized full-information item bifactor analysis. 
Psychological Methods, 16, 221-248. 

(Chang, H.-H., & Ying, Z. L, (1996). A global information approach to computerized adaptive 
testing. Applied Psychological Measurement, 20, 213-229, 

(Chang, H., & Ying, Z. (2008). To weight or not to weight? Balancing influence of intial items in 
adaptive testing, Psychometrika, 73, 441-450. 

‘Chen, P., & Wang, C. (2016), A new online calibration method for multidimensional 


computerized adaptive testing. Psychometrika, 81, 674-701. 


34 


Chen, P,, Wang, C., Xin, T,, & Chang, H.-H. (2017), Developing new online calibration methods 
for multidimensional computerized adaptive testing. British Journal of Mathematical and 
Statistical Psychology, 70, 81-117. 

Dean, V., Martineau, J, (2012). A state perspective on enhancing assessment and accountability 
systems through systematic implementation of technology. In Lissitz, R. W., Jiao, H. 
(Eds.), Computers and their impact on state assessment: Recent history and predictions 
for the future (pp. 55-77). Charlotte, NC: Information Age. 

Dempster, A. P., Laird, N. M,, & Rubin, D. B, (1977), Maximum likelihood from incomplete 
data viathe EM algorithm, Journal of the Royal Statistical Society, Series B, 39, 1-38. 

Eggen, T. J. H. M., & Verhelst, N. D, (2011). Item calibration in incomplete designs. 
Psychologica, 32, 107-132. 

Glas, C. A. W. (2010). Item parameter estimation and item fit analysis, In W. J. van der Linden 
&C. A. W. Glass (Eds.), Elements of adaptive testing (pp. 269-288). NewYork: 
Springer. 

Guo, R,, Zheng, Y., & Chang, H.-H. (2015). A stepwise test characteristic curve method to 
detect item parameter drift. Journal of Educational Measurement, 52, 280-300. 

Han, K. T., & Guo, F. (2014), Impact of violation of the missing-at-random assumption on full- 
information maximum likelihood method in multidimensional adaptive testing. Practical 


Assessment, Research & Evaluation, 19(2). Available online: 


hhtp://pareonline.net/getvn.asp?v=198n" 
Jewsbury, P., & van Rij, P, (2018). Random missing in multidimensional multistage testing: the 
importance of multivariate latent variable models, Paper presented at the NCME annual 


meeting, New York City, NY. 


35 


Kim, S, (2006). A comparative study of IRT fixed parameter calibration methods. Journal of 
Educational Measurement, 43, 395-381. 

Kim, S,, & Kolen, M, (2016). Multiple group IRT fixed-parameter estimation for maintaining an 
established ability scale, Center for Advanced Studies in Measurement and Assessment 
Report #49, 


hhups://education.uiows.edu/sites/education.uiowa.edu/files/documents/centers/casma/cas 


ma-research-report-49.pdf 

Lissitz, B., Jiao, H., Li, M., Lee, D., & Kang, Y, (2014), Software packages for multiple group 
IRT analysis and accuracy of parameter estimates. Executive Report for the Maryland 
State Department of Education, 

Little, RJLA. & Rubin, D.B, (1987). Statistical Analysis with Missing Data, New York: John 
Wiley & Sons, 

Liu, O,, Wilson, M., & Pack, 1 (2008). A multidimensional Rasch analysis of gender differences 
in PISA Mathematics. Journal of Applied Measurement, 9, 18-36. 

Lu, R, Jia, ¥., & Wu, M. (2018). Using design information in item parameter estimation with 
‘multistage testing. Paper presented at the NCME annual meeting, New York City, NY. 

Lu, R, Jia, ¥., & Wu, M. (2017). Population definition and Identification, priors, and non= 
random samples. Paper presented at the NCME annual meeting, San Antonio, TX. 

Mislevy, R. J. (1984). Estimating latent distributions. Psychometrika, 49, 359-381 

Mislevy, R. J. (1991). Randomization-based inference about latent variables from complex 


samples, Psychometrika, 56, 177-196, 


36 


Mislevy, R.J., Beaton, A. E., Kaplan, B., & Sheehan, K. M, (1992), Estimating population 
characteristics from sparse matrix samples of item responses. Journal of Educational 
Measurement, 29, 133-161. 

Mislevy, RJ. & Sheenan, K.M, (1989). The role of collateral information about examinees in 
item parameter estimation. Psychometrika, 54, 661-680. 

Mislevy, RJ. & Wu, P-K (1996). Inferring examinee ability when some item responses are 
missing, Research Report RR-96-30-ONR. Princeton: Educational Testing Service. 

Rubin, D. B. (1976). Inference and missing data. Biometrika, 63, 581-592. 

Rubin, D. B. (1991). EM and beyond. Psychometrika, 56, 241-254. 

Rubin, D. B., & Thayer, D. T. (1982). EM algorithms for ML factor analysis. Psychometrika, 47, 
69-76. 

‘Swaminathan, H., & Gifford, J. A. (1986). Bayesian estimation in the three-parameter logistic 
model. Psychometrika, 51, 589-601. 

‘Thissen, D., & Wainer, H. (1982). Some standard errors in item response theory. Psychometrika, 
47, 397-412, 

‘Van Groen, M. (2017). Multistage testing with multiple subjects. Invited talk atthe Intemational 
Association of Computerized Adaptive Testing (IACAT), Niigata, Japan. 

‘Wang, C., Zheng, Y., & Chang, H. (2014). Does standard deviation matter? Using “standard 
deviation” to quantify security of multistage testing. Psychometrika, 79, 154-174. 

Wainer, H. (1990). Computerized adaptive testing: A primer. Hillsdale: Erlbaum, 

‘Weiss, D. J. (1982). Improving measurement quality and efficiency with adaptive testing. 
Applied Psychological Measurement, 6, 473-492. 


Woodruff, D. 1, & Hanson, B. A. (1996). Estimation of item response models using the EM 


7 


algorithm for finite mixtures (ACT Research Report 96-6). lowa City, IA: ACT, Inc. 

‘Wu, M., & Lu, R. (2017). Multi-stage testing simulation studies. Paper presented at the NCME 
annual meeting, San Antonio, TX. 

‘Wu, M., & Xi, N, (2017). Multi-stage testing in the 2015 NAEP mathematics DBA field trial 
Paper presented at the NCME annual meeting, San Antonio, TX. 

Yan, D. L., Von Davier, A. A., & Lewis, C. (2014). Computerized multistage testing: Theory 


«and applications. NY: CRC press. 


Appendix 


In this Appendix, we provide derivations showing that when the MAR assumption is satisfied 


(Ge, routing based on 4 ) the EM algorithm introduced in section 3.2can also proceed based 
solely on the observed data 


Specifically, for the E-step, we can write the conditional expectation as follows, 


Evy) Ivano" (log(L (A, m1¥, 6,m))) 


mie Wyonad? a” (Zie 108 (LCA, ly one Yumi 81) X Ry (4 lyons Yimie81))) 


Mes Eiyimito yon” (108 (L(B.- Aly. cvs) LB. Alymis) X 9¢6ilm) X h(amilyfans))) 


a (tog (L(y .ans) Xm % fg (mulyoas)) x P(OxLYions-A"-")) 


ZE1 Femina (108 (L (Pur Alyn) X P(e |Y.a00 8"2)) 


38 


1 (low(L (x, Alans) me) x P(BxLYLans, A" 70") + 


Elks (low (omy ons)) ¥ P(@x| 9.00» A") + 


TET Et Eyimsdyione’a” (108 (L(G Alyimis)) X P(GeL¥.o0s-4" 2) ay 
‘The second to the ast equality holds because the expectation Eiy,epinaseatat isactually & 
double integal, one with respect tothe distribution of Yi nd the other with respect othe 
distribution of Then, the frst erm in this equality Is imlevant ois hence it canbe taken 
cus the expectation with respect 10 Yue Fsulting in nly one integral hat is written asa 
Inthe last equality in Bg (AL, the ist term is simply the conditional expectation ofthe 
log-likelihood based on observed data (i.e, the same as Eq 5), the second term i irelevant to 
the target parameters, whereas the third term actually vanishes inthe M-step. The explanation i 
2s follows. Without loss of generality, tke tem jas an example. Take a first-order derivative 


‘with respect to A), we have 


Entiat (MEGA Oly ne AT) ) (Oya 0") 
CO rn eae ae | @2) 


‘where yijmis denotes the missing responses of person jon item, and y—jnis denotes the 
missing responses of person i on the remaining items except item j. Equation (A2) holds because 
1) PCG: yon." 207) is ivelevant to the distribution of yiyuis and hence it can be taken outside 
the expectation; and (2) due to the discreetness of J, nis the expectation with respect tothe 


posterior distribution of Ym can be expanded as a series of expectations. 


39 


Because L(x, Alyiimis) = P(yumis = 10 A)" "™(A — P(yysmis = 16x, A)) 4 
Consider item parameter ay as an example, then 


Aogle @uAlyeims 
5 


1.7 (Yims — P(Yumis = 118-4) (a3) 


‘And because y} mis follows a Bernoulli distribution, it is easily shown that the expectation of 


(93) with spect tte dstibuton of yy Ey (MLSE, As ares 


ay 
Eq. (A2) also becomes 0 and hence it vanishes in the M-step. Therefore, with missing data 


satisfying MAR, the EM algorithm can proceed in the same fashion as in section 3.2 using the 


observed data. 


40 


